Exploratory Data Analysis
Exploratory Data Analysis
Flight Delays and Cancellations from the Bureau of Transportation Statistics
Dataset compiled by Patrick Zelazko.
- This is a large dataset with with 3 million observations, each a specific flight, and 32 features. The data is from flights within the United States from 2019 through 2023. Diverted and cancelled flights are recorded, as are the time in minuted and attributed reasons for delay.
Following are the definitions of the given variables in this dataset.
| Header | Description |
|---|---|
| Fl Date | Flight Date (yyyy-mm-dd) |
| Airline | Airline Name |
| Airline DOT | Airline Name and Unique Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users, for example, PA, PA(1), PA(2). Use this field for analysis across a range of years. |
| Airline Code | Unique Carrier Code |
| DOT Code | An identification number assigned by US DOT to identify a unique airline (carrier). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation. |
| Fl Number | Flight Number |
| Origin | Origin Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused. |
| Origin City | Origin City Name, State Code |
| Dest | Destination Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused. |
| Dest City | Destination City Name, State Code |
| CRS Dep Time | CRS Departure Time (local time: hhmm) |
| Dep Time | Actual Departure Time (local time: hhmm) |
| Dep Delay | Difference in minutes between scheduled and actual departure time. Early departures show negative numbers. |
| Taxi Out | Taxi Out Time, in Minutes |
| Wheels Off | Wheels Off Time (local time: hhmm) |
| Wheels On | Wheels On Time (local time: hhmm) |
| Taxi In | Taxi In Time, in Minutes |
| CRS Arr Time | CRS Arrival Time (local time: hhmm) |
| Arr Time | Actual Arrival Time (local time: hhmm) |
| Arr Delay | Difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers. |
| Cancelled | Cancelled Flight Indicator (1=Yes) |
| Cancellation Code | Specifies The Reason For Cancellation |
| Diverted | Diverted Flight Indicator (1=Yes) |
| CRS Elapsed Time | CRS Elapsed Time of Flight, in Minutes |
| Actual Elapsed Time | Elapsed Time of Flight, in Minutes |
| Air Time | Flight Time, in Minutes |
| Distance | Distance between airports (miles) |
| Carrier Delay | Carrier Delay, in Minutes |
| Weather Delay | Weather Delay, in Minutes |
| NAS Delay | National Air System Delay, in Minutes |
| Security Delay | Security Delay, in Minutes |
| Late Aircraft Delay | Late Aircraft Delay, in Minutes |
Table 1 for the dataset.
| Flight Delay Summary by Flight Period | ||||
| Flight Period | Flight Period | |||
|---|---|---|---|---|
| Morning | Afternoon | Evening | Total | |
| TotalFlightsCount | 1246031 (41.5%) | 1423140 (47.4%) | 330829 (11.0%) | 3000000 (100%) |
| CancelledFlightsCount | 30690 (38.8%) | 38343 (48.4%) | 10107 (12.8%) | 79140 (100%) |
| DivertedFlightsCount | 2555 (36.2%) | 3901 (55.3%) | 600 (8.5%) | 7056 (100%) |
| AvgCRSDepTime | 08:49:31 | 15:73:19 | 20:66:23 | 13:27:04 |
| AvgDepTime | 08:53:58 | 15:89:05 | 20:12:40 | 13:29:47 |
| AvgDepDelay | 5.23 | 12.93 | 16.51 | 10.12 |
| AvgTaxiOut | 16.87 | 16.44 | 16.65 | 16.64 |
| AvgTaxiIn | 7.75 | 7.78 | 6.95 | 7.68 |
| AvgCRSArrTime | 10:87:15 | 17:85:11 | 17:42:14 | 14:90:34 |
| AvgArrTime | 10:86:01 | 17:71:56 | 15:89:47 | 14:66:31 |
| AvgArrDelay | -0.77 | 7.34 | 10.04 | 4.26 |
| AvgAirTime | 114.12 | 109.8 | 116.31 | 112.31 |
| CarrierDelayCount | 86824 (29.2%) | 162266 (54.6%) | 47861 (16.1%) | 296951 (100%) |
| SecurityDelayCount | 887 (32.1%) | 1434 (52.0%) | 438 (15.9%) | 2759 (100%) |
| WeatherDelayCount | 8380 (26.7%) | 18758 (59.7%) | 4290 (13.7%) | 31428 (100%) |
| NASDelayCount | 80604 (31.4%) | 144366 (56.3%) | 31507 (12.3%) | 256477 (100%) |
| LateAircraftDelayCount | 42721 (16.5%) | 168902 (65.2%) | 47391 (18.3%) | 259014 (100%) |
| Table 1: Summary includes morning, afternoon, and evening flight periods. | ||||
The three flight periods are each comprised of 8-hour segments (i.e. Morning has flights with departure times from 4am to noon followed by afternoon and evening). The Afternoon period is comprised of the most flights (47.4%), followed closely by the Morning period (41.5%), and the Evening period trails the two (11%). The table also gives the means of the departure and arrival times, giving an indication of the density of the flights in the given period. The average departure and arrival delays show much better numbers for the Morning period (5.23, -0.77 minutes) with increasing delays for the Afternoon and Evening periods. The delay counts by type show That the Afternoon and Morning periods account for significantly more of the total delays, though that is without taking into account the smaller contribution of flights by the Evening period on the whole.
Some Visualizations of the Dataset
- These histograms illustrate the frequencies of air time, arrival delays, and departure delays. The y-axis was transformed to make the visualizations more legible. All show a skew to the right. This makes sense for air times with a higher proportion of regional flights and the exclusion of international departures and arrivals. Shorter delays (for both arrivals and departures) being more frequent than longer delays is also to be expected.
- This visualization shows the average arrival delay for the largest five airlines (filtered for carriers with over 200,000 flights in the given period). The standard deviations for these airlines are fairly small, indicating a low variability in the arrival delays for these airlines.
This heat map shows the average arrival delay for flights at their origin airport. This comes from the idea that if a flight is delayed at departure, then it may also be delayed on arrival at it’s destination.
Airport location information downloaded from https://github.com/RandomFractals/geo-data-viewer/blob/master/data/csv/usa-airports.csv
Correlation Matrix for Continuous Variables:
DEP_DELAY TAXI_OUT TAXI_IN ARR_DELAY CRS_ELAPSED_TIME
DEP_DELAY 1.00000000 0.04483006 0.01783299 0.965628452 0.022205036
TAXI_OUT 0.04483006 1.00000000 0.02466135 0.186389179 0.079238740
TAXI_IN 0.01783299 0.02466135 1.00000000 0.110128380 0.102555059
ARR_DELAY 0.96562845 0.18638918 0.11012838 1.000000000 -0.003073467
CRS_ELAPSED_TIME 0.02220504 0.07923874 0.10255506 -0.003073467 1.000000000
ELAPSED_TIME 0.02609654 0.18252089 0.16957221 0.049369903 0.982448199
AIR_TIME 0.01928498 0.05349243 0.08056859 0.016201773 0.989281781
DISTANCE 0.02002126 0.04030996 0.07296821 0.001217362 0.982538270
ELAPSED_TIME AIR_TIME DISTANCE
DEP_DELAY 0.02609654 0.01928498 0.020021260
TAXI_OUT 0.18252089 0.05349243 0.040309965
TAXI_IN 0.16957221 0.08056859 0.072968208
ARR_DELAY 0.04936990 0.01620177 0.001217362
CRS_ELAPSED_TIME 0.98244820 0.98928178 0.982538270
ELAPSED_TIME 1.00000000 0.98764824 0.969600832
AIR_TIME 0.98764824 1.00000000 0.983888247
DISTANCE 0.96960083 0.98388825 1.000000000
Testing between AIRLINE_CODE and DEP_DELAY :
Testing between AIRLINE_CODE and TAXI_OUT :
Testing between AIRLINE_CODE and TAXI_IN :
Testing between AIRLINE_CODE and ARR_DELAY :
Testing between AIRLINE_CODE and CRS_ELAPSED_TIME :
Testing between AIRLINE_CODE and ELAPSED_TIME :
Testing between AIRLINE_CODE and AIR_TIME :
Testing between AIRLINE_CODE and DISTANCE :
Testing between ORIGIN and DEP_DELAY :
Testing between ORIGIN and TAXI_OUT :
Testing between ORIGIN and TAXI_IN :
Testing between ORIGIN and ARR_DELAY :
Testing between ORIGIN and CRS_ELAPSED_TIME :
Testing between ORIGIN and ELAPSED_TIME :
Testing between ORIGIN and AIR_TIME :
Testing between ORIGIN and DISTANCE :
Testing between DEST and DEP_DELAY :
Testing between DEST and TAXI_OUT :
Testing between DEST and TAXI_IN :
Testing between DEST and ARR_DELAY :
Testing between DEST and CRS_ELAPSED_TIME :
Testing between DEST and ELAPSED_TIME :
Testing between DEST and AIR_TIME :
Testing between DEST and DISTANCE :
Testing between DELAY_DUE_CARRIER and DEP_DELAY :
Testing between DELAY_DUE_CARRIER and TAXI_OUT :
Testing between DELAY_DUE_CARRIER and TAXI_IN :
Testing between DELAY_DUE_CARRIER and ARR_DELAY :
Testing between DELAY_DUE_CARRIER and CRS_ELAPSED_TIME :
Testing between DELAY_DUE_CARRIER and ELAPSED_TIME :
Testing between DELAY_DUE_CARRIER and AIR_TIME :
Testing between DELAY_DUE_CARRIER and DISTANCE :
Testing between DELAY_DUE_WEATHER and DEP_DELAY :
Testing between DELAY_DUE_WEATHER and TAXI_OUT :
Testing between DELAY_DUE_WEATHER and TAXI_IN :
Testing between DELAY_DUE_WEATHER and ARR_DELAY :
Testing between DELAY_DUE_WEATHER and CRS_ELAPSED_TIME :
Testing between DELAY_DUE_WEATHER and ELAPSED_TIME :
Testing between DELAY_DUE_WEATHER and AIR_TIME :
Testing between DELAY_DUE_WEATHER and DISTANCE :
Testing between DELAY_DUE_NAS and DEP_DELAY :
Testing between DELAY_DUE_NAS and TAXI_OUT :
Testing between DELAY_DUE_NAS and TAXI_IN :
Testing between DELAY_DUE_NAS and ARR_DELAY :
Testing between DELAY_DUE_NAS and CRS_ELAPSED_TIME :
Testing between DELAY_DUE_NAS and ELAPSED_TIME :
Testing between DELAY_DUE_NAS and AIR_TIME :
Testing between DELAY_DUE_NAS and DISTANCE :
Testing between DELAY_DUE_SECURITY and DEP_DELAY :
Testing between DELAY_DUE_SECURITY and TAXI_OUT :
Testing between DELAY_DUE_SECURITY and TAXI_IN :
Testing between DELAY_DUE_SECURITY and ARR_DELAY :
Testing between DELAY_DUE_SECURITY and CRS_ELAPSED_TIME :
Testing between DELAY_DUE_SECURITY and ELAPSED_TIME :
Testing between DELAY_DUE_SECURITY and AIR_TIME :
Testing between DELAY_DUE_SECURITY and DISTANCE :
Testing between DELAY_DUE_LATE_AIRCRAFT and DEP_DELAY :
Testing between DELAY_DUE_LATE_AIRCRAFT and TAXI_OUT :
Testing between DELAY_DUE_LATE_AIRCRAFT and TAXI_IN :
Testing between DELAY_DUE_LATE_AIRCRAFT and ARR_DELAY :
Testing between DELAY_DUE_LATE_AIRCRAFT and CRS_ELAPSED_TIME :
Testing between DELAY_DUE_LATE_AIRCRAFT and ELAPSED_TIME :
Testing between DELAY_DUE_LATE_AIRCRAFT and AIR_TIME :
Testing between DELAY_DUE_LATE_AIRCRAFT and DISTANCE :
Testing between CRS_DEP_HOUR and DEP_DELAY :
Testing between CRS_DEP_HOUR and TAXI_OUT :
Testing between CRS_DEP_HOUR and TAXI_IN :
Testing between CRS_DEP_HOUR and ARR_DELAY :
Testing between CRS_DEP_HOUR and CRS_ELAPSED_TIME :
Testing between CRS_DEP_HOUR and ELAPSED_TIME :
Testing between CRS_DEP_HOUR and AIR_TIME :
Testing between CRS_DEP_HOUR and DISTANCE :
Testing between DEP_HOUR and DEP_DELAY :
Testing between DEP_HOUR and TAXI_OUT :
Testing between DEP_HOUR and TAXI_IN :
Testing between DEP_HOUR and ARR_DELAY :
Testing between DEP_HOUR and CRS_ELAPSED_TIME :
Testing between DEP_HOUR and ELAPSED_TIME :
Testing between DEP_HOUR and AIR_TIME :
Testing between DEP_HOUR and DISTANCE :
Testing between WHEELS_OFF_HOUR and DEP_DELAY :
Testing between WHEELS_OFF_HOUR and TAXI_OUT :
Testing between WHEELS_OFF_HOUR and TAXI_IN :
Testing between WHEELS_OFF_HOUR and ARR_DELAY :
Testing between WHEELS_OFF_HOUR and CRS_ELAPSED_TIME :
Testing between WHEELS_OFF_HOUR and ELAPSED_TIME :
Testing between WHEELS_OFF_HOUR and AIR_TIME :
Testing between WHEELS_OFF_HOUR and DISTANCE :
Testing between WHEELS_ON_HOUR and DEP_DELAY :
Testing between WHEELS_ON_HOUR and TAXI_OUT :
Testing between WHEELS_ON_HOUR and TAXI_IN :
Testing between WHEELS_ON_HOUR and ARR_DELAY :
Testing between WHEELS_ON_HOUR and CRS_ELAPSED_TIME :
Testing between WHEELS_ON_HOUR and ELAPSED_TIME :
Testing between WHEELS_ON_HOUR and AIR_TIME :
Testing between WHEELS_ON_HOUR and DISTANCE :
Testing between CRS_ARR_HOUR and DEP_DELAY :
Testing between CRS_ARR_HOUR and TAXI_OUT :
Testing between CRS_ARR_HOUR and TAXI_IN :
Testing between CRS_ARR_HOUR and ARR_DELAY :
Testing between CRS_ARR_HOUR and CRS_ELAPSED_TIME :
Testing between CRS_ARR_HOUR and ELAPSED_TIME :
Testing between CRS_ARR_HOUR and AIR_TIME :
Testing between CRS_ARR_HOUR and DISTANCE :
Testing between ARR_HOUR and DEP_DELAY :
Testing between ARR_HOUR and TAXI_OUT :
Testing between ARR_HOUR and TAXI_IN :
Testing between ARR_HOUR and ARR_DELAY :
Testing between ARR_HOUR and CRS_ELAPSED_TIME :
Testing between ARR_HOUR and ELAPSED_TIME :
Testing between ARR_HOUR and AIR_TIME :
Testing between ARR_HOUR and DISTANCE :
Testing between FLIGHT_PERIOD and DEP_DELAY :
Testing between FLIGHT_PERIOD and TAXI_OUT :
Testing between FLIGHT_PERIOD and TAXI_IN :
Testing between FLIGHT_PERIOD and ARR_DELAY :
Testing between FLIGHT_PERIOD and CRS_ELAPSED_TIME :
Testing between FLIGHT_PERIOD and ELAPSED_TIME :
Testing between FLIGHT_PERIOD and AIR_TIME :
Testing between FLIGHT_PERIOD and DISTANCE :
library(ggplot2)
library(dplyr)
library(ggcorrplot)
library(tidyr)
numeric_columns <- Delays_sample %>% select_if(is.numeric)
cor_matrix <- cor(numeric_columns, use = "complete.obs")
ggcorrplot(cor_matrix,
method = "square",
type = "upper",
lab = FALSE,
title = "Correlation Matrix of Flight Delay Variables",
colors = c("blue", "white", "red"),
tl.cex = 6,
ggtheme = theme_minimal())df_long <- Delays_sample %>%
gather(key = "Delay_Type", value = "Value", DELAY_DUE_CARRIER:DELAY_DUE_LATE_AIRCRAFT)
#histogram arr_delay
ggplot(Delays_sample, aes(x = ARR_DELAY)) +
geom_histogram(binwidth = 10, fill = "blue", color = "black") +
ggtitle("Distribution of Arrival Delay") +
xlab("Arrival Delay (minutes)") +
ylab("Frequency") +
xlim(NA, 500)##ALL DEALY TYPES AVERAGED TOGETHER< THIS CODE DOESN"T WORK
# Boxplot delay type & arrival delay
ggplot(df_long, aes(x = Delay_Type, y = ARR_DELAY, fill = Delay_Type)) +
geom_boxplot() +
stat_summary(fun = mean, geom = "text", aes(label = round(..y.., 1)), vjust = -0.5, color = "red", size = 4) +
scale_y_log10() +
ggtitle("Boxplot of Arrival Delay by Delay Type") +
xlab("Delay Type") +
ylab("Arrival Delay (minutes)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))